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Abstract 


In this paper we study boosting methods from a new perspective. We build on recent work by Efron 
et al. to show that boosting approximately (and in some cases exactly) minimizes its loss criterion 
with an /; constraint on the coefficient vector. This helps understand the success of boosting with 
early stopping as regularized fitting of the loss criterion. For the two most commonly used crite- 
ria (exponential and binomial log-likelihood), we further show that as the constraint is relaxed—or 
equivalently as the boosting iterations proceed—the solution converges (in the separable case) to an 
“],-optimal” separating hyper-plane. We prove that this /;-optimal separating hyper-plane has the 
property of maximizing the minimal /,-margin of the training data, as defined in the boosting liter- 
ature. An interesting fundamental similarity between boosting and kernel support vector machines 
emerges, as both can be described as methods for regularized optimization in high-dimensional 
predictor space, using a computational trick to make the calculation practical, and converging to 
margin-maximizing solutions. While this statement describes SVMs exactly, it applies to boosting 
only approximately. 

Keywords: boosting, regularized optimization, support vector machines, margin maximization 


1. Introduction and Outline 


Boosting is a method for iteratively building an additive model 
T 
Fr(x) = } oh; (x), (1) 
=l 


where h;, € H—a large (but we will assume finite) dictionary of candidate predictors or “weak 
learners”; and h;, is the basis function selected as the “best candidate” to modify the function at 
stage t. The model Fr can equivalently be represented by assigning a coefficient to each dictionary 
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function h € H rather than to the selected h j,’s only: 
7 (T) 
Fr(x) = X hj(x) b; (2) 
j=l 


where J = |H| and py = ¥j,-;%. The “p” representation allows us to interpret the coefficient 
vector B(T) as a vector in R/ or, equivalently, as the hyper-plane which has B'7) as its normal. This 
interpretation will play a key role in our exposition. 

Some examples of common dictionaries are: 


e The training variables themselves, in which case hj(x) = xj. This leads to our “additive” 
model Fr being just a linear model in the original data. The number of dictionary functions 
will be J = d, the dimension of x. 


Polynomial dictionary of degree p, in which case the number of dictionary functions will be 


_( ptd 
(ee) 


Decision trees with up to k terminal nodes, if we limit the split points to data points (or mid- 
way between data points as CART does). The number of possible trees is bounded from 
above (trivially) by J < (np)*- 2" Note that regression trees do not fit into our framework, 
since they will give J = œ. 


The boosting idea was first introduced by Freund and Schapire (1995), with their AdaBoost 
algorithm. AdaBoost and other boosting algorithms have attracted a lot of attention due to their great 
success in data modeling tasks, and the “mechanism” which makes them work has been presented 
and analyzed from several perspectives. Friedman et al. (2000) develop a statistical perspective, 
which ultimately leads to viewing AdaBoost as a gradient-based incremental search for a good 
additive model (more specifically, it is a “coordinate descent” algorithm), using the exponential loss 
function C(y,F) = exp(—yF), where y € {—1,1}. The gradient boosting (Friedman, 2001) and 
anyboost (Mason et al., 1999) generic algorithms have used this approach to generalize the boosting 
idea to wider families of problems and loss functions. In particular, Friedman et al. (2000) have 
pointed out that the binomial log-likelihood loss C(y, F) = log(1 +exp(—yF)) is a more natural 
loss for classification, and is more robust to outliers and misspecified data. 

A different analysis of boosting, originating in the machine learning community, concentrates on 
the effect of boosting on the margins y;F (x;). For example, Schapire et al. (1998) use margin-based 
arguments to prove convergence of boosting to perfect classification performance on the training 
data under general conditions, and to derive bounds on the generalization error (on future, unseen 
data). 

In this paper we combine the two approaches, to conclude that gradient-based boosting can be 
described, in the separable case, as an approximate margin maximizing process. The view we de- 
velop of boosting as an approximate path of optimal solutions to regularized problems also justifies 
early stopping in boosting as specifying a value for “regularization parameter”. 

We consider the problem of minimizing non-negative convex loss functions (in particular the 
exponential and binomial log-likelihood loss functions) over the training data, with an /; bound on 
the model coefficients: 

B(c) =arg min \°C(yi,h(xi)'B). (3) 


Bllise“F 
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Where h(x;) = [h1 (x;),42(x;),..-,47(x;)] and J =|H|.! 

Hastie et al. (2001, Chapter 10) have observed that “slow” gradient-based boosting (i.e., we set 
a, = £ ,Vt in (1), with € small) tends to follow the penalized path B(c) as a function of c, under 
some mild conditions on this path. In other words, using the notation of (2), (3), this implies that 
\|8'</©) — B(c)|| vanishes with £, for all (or a wide range of) values of c. Figure 1 illustrates this 
equivalence between €-boosting and the optimal solution of (3) on a real-life data set, using squared 
error loss as the loss function. In this paper we demonstrate this equivalence further and formally 
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Figure 1: Exact coefficient paths(left) for /;-constrained squared error regression and “boosting” 
coefficient paths (right) on the data from a prostate cancer study 


state it as a conjecture. Some progress towards proving this conjecture has been made by Efron et al. 
(2004), who prove a weaker “local” result for the case where C is squared error loss, under some 
mild conditions on the optimal path. We generalize their result to general convex loss functions. 
Combining the empirical and theoretical evidence, we conclude that boosting can be viewed as 
an approximate incremental method for following the /,-regularized path. 
We then prove that in the separable case, for both the exponential and logistic log-likelihood 
loss functions, B(c) /c converges as c — œ to an “optimal” separating hyper-plane f described by 


6 = arg P miny;B'h(x;). (4) 
i=1 1 


In other words, Ê maximizes the minimal margin among all vectors with /;-norm equal to 1.? This 
result generalizes easily to other /,-norm constraints. For example, if p = 2, then B describes the 
optimal separating hyper-plane in the Euclidean sense, i.e., the same one that a non-regularized 
support vector machine would find. 

Combining our two main results, we get the following characterization of boosting: 





1. Our notation assumes that the minimum in (3) is unique, which requires some mild assumptions. To avoid notational 
complications we use this slightly abusive notation throughout this paper. In Appendix B we give explicit conditions 
for uniqueness of this minimum. 

2. The margin maximizing hyper-plane in (4) may not be unique, and we show that in that case the limit p is still defined 
and it also maximizes the second minimal margin. See Appendix B.2 for details. 
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€-Boosting can be described as a gradient-descent search, approximately following the 
path of /;-constrained optimal solutions to its loss criterion, and converging, in the 
separable case, to a “margin maximizer” in the /; sense. 


Note that boosting with a large dictionary H (in particular if n < J = |H |) guarantees that the data 
will be separable (except for pathologies), hence separability is a very mild assumption here. 


As in the case of support vector machines in high dimensional feature spaces, the non-regularized 
“optimal” separating hyper-plane is usually of theoretical interest only, since it typically represents 
an over-fitted model. Thus, we would want to choose a good regularized model. Our results indicate 
that Boosting gives a natural method for doing that, by “stopping early” in the boosting process. Fur- 
thermore, they point out the fundamental similarity between Boosting and SVMs: both approaches 
allow us to fit regularized models in high-dimensional predictor space, using a computational trick. 
They differ in the regularization approach they take—exact l2 regularization for SVMs, approximate 
lı regularization for Boosting—-and in the computational trick that facilitates fitting—the “kernel” 
trick for SVMs, coordinate descent for Boosting. 


1.1 Related Work 


Schapire et al. (1998) have identified the normalized margins as distance from an /|-normed sep- 
arating hyper-plane. Their results relate the boosting iterations’ success to the minimal margin of 
the combined model. Rätsch et al. (2001b) take this further using an asymptotic analysis of Ad- 
aBoost. They prove that the “normalized” minimal margin, min; y; X; rh: (x;)/¥;, |O;|, is asymptoti- 
cally equal for both classes. In other words, they prove that the asymptotic separating hyper-plane is 
equally far away from the closest points on either side. This is a property of the margin maximizing 
separating hyper-plane as we define it. Both papers also illustrate the margin maximizing effects of 
AdaBoost through experimentation. However, they both stop short of proving the convergence to 
optimal (margin maximizing) solutions. 


Motivated by our result, Rätsch and Warmuth (2002) have recently asserted the margin-maximizing 
properties of €-AdaBoost, using a different approach than the one used in this paper. Their results 
relate only to the asymptotic convergence of infinitesimal AdaBoost, compared to our analysis of 
the “regularized path” traced along the way and of a variety of boosting loss functions, which also 
leads to a convergence result on binomial log-likelihood loss. 


The convergence of boosting to an “optimal” solution from a loss function perspective has been 
analyzed in several papers. Ratsch et al. (2001a) and Collins et al. (2000) give results and bounds on 
the convergence of training-set loss, ¥};C(y;, £, O,/;(x;)), to its minimum. However, in the separable 
case convergence of the loss to 0 is inherently different from convergence of the linear separator to 
the optimal separator. Any solution which separates the two classes perfectly can drive the expo- 
nential (or log-likelihood) loss to 0, simply by scaling coefficients up linearly. 


Two recent papers have made the connection between boosting and l; regularization in a slightly 
different context than this paper. Zhang (2003) suggests a “shrinkage” version of boosting which 
converges to lı regularized solutions, while Zhang and Yu (2003) illustrate the quantitative relation- 
ship between early stopping in boosting and l; constraints. 
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2. Boosting as Gradient Descent 


Generic gradient-based boosting algorithms (Friedman, 2001; Mason et al., 1999) attempt to find a 
good linear combination of the members of some dictionary of basis functions to optimize a given 
loss function over a sample. This is done by searching, at each iteration, for the basis function which 
gives the “steepest descent” in the loss, and changing its coefficient accordingly. In other words, 
this is a “coordinate descent” algorithm in IR’, where we assign one dimension (or coordinate) for 
the coefficient of each dictionary function. 

Assume we have data {x;,y;}"_, with x; € Rf, a loss (or cost) function C(y,F), and a set of 
dictionary functions {h;(x)} : Rf — R. Then all of these algorithms follow the same essential 
steps: 


Algorithm 1 Generic gradient-based boosting algorithm 
1. Set B® =0. 
2. Fort=1:T, 


(a) Let F; = B-)h(x;), i=1,...,n (the current fit). 





(b) Set w= EAD) i=l, 


(c) Identify j; = argmax; |}; wih; (x)|. 
(a) Set BY) = BY) — ousign(Ziwih; (mi)) and BP = BL? kZ jn 


Here B® is the “current” coefficient vector and a > 0 is the current step size. Notice that X; w;h;, (xi) = 
dL: COLF) 

As we mentioned, Algorithm 1 can be interpreted simply as a coordinate descent algorithm in 
“weak learner” space. Implementation details include the dictionary H of “weak learners”, the loss 
function C(y, F), the method of searching for the optimal j; and the way in which a, is determined. 
For example, the original AdaBoost algorithm uses this scheme with the exponential loss C(y, F) = 
exp(—yF’), and an implicit line search to find the best &; once a “direction” j; has been chosen (see 
Hastie et al., 2001; Mason et al., 1999). The dictionary used by AdaBoost in this formulation would 


be a set of candidate classifiers, i.e., h ;(x;) € {—1,-+1}—usually decision trees are used in practice. 


2.1 Practical Implementation of Boosting 


The dictionaries used for boosting are typically very large—practically infinite—and therefore the 
generic boosting algorithm we have presented cannot be implemented verbatim. In particular, it is 
not practical to exhaustively search for the maximizer in step 2(c). Instead, an approximate, usually 
greedy search is conducted to find a “good” candidate weak learner h;, which makes the first order 
decline in the loss large (even if not maximal among all possible models). 

In the common case that the dictionary of weak learners is comprised of decision trees with 
up to k nodes, the way AdaBoost and other boosting algorithms solve stage 2(c) is by building a 





3. The sign of œ will always be —sign(); w;h;,(x;)), since we want the loss to be reduced. In most cases, the dictionary 
H is negation closed, and so it can be assumed WLOG that the coefficients are always positive and increasing 
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decision tree to a re-weighted version of the data, with the weights |w;|. Thus they first replace step 
2(c) with minimization of 


È Iwill; # hj,(xi)}, 


which is easily shown to be equivalent to the original step 2(c). They then use a greedy decision- 
tree building algorithm such as CART or C5 to build a k-node decision tree which minimizes this 
quantity, i.e., achieves low “weighted misclassification error” on the weighted data. Since the tree is 
built greedily—one split at a time—it will not be the global minimizer of weighted misclassification 
error among all k-node decision trees. However, it will be a good fit for the re-weighted data, and 
can be considered an approximation to the optimal tree. 

This use of approximate optimization techniques is critical, since much of the strength of the 
boosting approach comes from its ability to build additive models in very high-dimensional predic- 
tor spaces. In such spaces, standard exact optimization techniques are impractical: any approach 
which requires calculation and inversion of Hessian matrices is completely out of the question, 
and even approaches which require only first derivatives, such as coordinate descent, can only be 
implemented approximately. 


2.2 Gradient-Based Boosting as a Generic Modeling Tool 


As Friedman (2001); Mason et al. (1999) mention, this view of boosting as gradient descent allows 
us to devise boosting algorithms for any function estimation problem—all we need is an appro- 
priate loss and an appropriate dictionary of “weak learners”. For example, Friedman et al. (2000) 
suggested using the binomial log-likelihood loss instead of the exponential loss of AdaBoost for 
binary classification, resulting in the LogitBoost algorithm. However, there is no need to limit 
boosting algorithms to classification—Friedman (2001) applied this methodology to regression es- 
timation, using squared error loss and regression trees, and Rosset and Segal (2003) applied it to 
density estimation, using the log-likelihood criterion and Bayesian networks as weak learners. Their 
experiments and those of others illustrate that the practical usefulness of this approach—coordinate 
descent in high dimensional predictor space—carries beyond classification, and even beyond super- 
vised learning. 

The view we present in this paper, of coordinate-descent boosting as approximate /,-regularized 
fitting, offers some insight into why this approach would be good in general: it allows us to fit regu- 
larized models directly in high dimensional predictor space. In this it bears a conceptual similarity 
to support vector machines, which exactly fit an l2 regularized model in high dimensional (RKH) 
predictor space. 


2.3 Loss Functions 


The two most commonly used loss functions for boosting classification models are the exponential 
and the (minus) binomial log-likelihood: 


Exponential:  Ce(y,F) = exp(—yF); 
Loglikelihood: Cı(y,F) =log(1+exp(—yF)). 


These two loss functions bear some important similarities to each other. As Friedman et al. (2000) 
show, the population minimizer of expected loss at point x is similar for both loss functions and is 
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Figure 2: The two classification loss functions 


given by 
P(y = 1 |x) | 
P(y =~1]x)]’ 
where ce = 1/2 for exponential loss and c; = 1 for binomial loss. 
More importantly for our purpose, we have the following simple proposition, which illustrates 
the strong similarity between the two loss functions for positive margins (i.e., correct classifica- 
tions): 


F(x) =c-108| 


Proposition 1 
yF 2 0 => 0.5C.(y, F) < Cily, F) < Ce, F). (5) 


In other words, the two losses become similar if the margins are positive, and both behave like 
exponentials. 


Proof Consider the functions fı (z) = z and f2(z) = log(1 +z) for z € [0,1]. Then f1(0) = f2(0) =0, 








and 
dfi(z) _ i 
az o 
1 x dfa(z) Ss ol 
27 a 1+z7 
Thus we can conclude 0.5f1(z2) < fo(z) < fi(z). Now set z = exp(—yf) and we get the desired 
result. E 


For negative margins the behaviors of C, and C; are very different, as Friedman et al. (2000) 
have noted. In particular, C; is more robust against outliers and misspecified data. 


2.4 Line-Search Boosting vs. ¢-Boosting 


As mentioned above, AdaBoost determines 0; using a line search. In our notation for Algorithm 1 
this would be 


r = ar DE i + ah, (x;)). 
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The alternative approach, suggested by Friedman (2001); Hastie et al. (2001), is to “shrink” all a, 
to a single small value €. This may slow down learning considerably (depending on how small € 
is), but is attractive theoretically: the first-order theory underlying gradient boosting implies that 
the weak learner chosen is the best increment only “locally”. It can also be argued that this ap- 
proach is “stronger” than line search, as we can keep selecting the same hj, repeatedly if it remains 
optimal and so €-boosting dominates line-search boosting in terms of training error. In practice, 
this approach of “slowing the learning rate” usually performs better than line-search in terms of 
prediction error as well (see Friedman, 2001). For our purposes, we will mostly assume € is in- 
finitesimally small, so the theoretical boosting algorithm which results is the “limit” of a series of 
boosting algorithms with shrinking €. 

In regression terminology, the line-search version is equivalent to forward stage-wise modeling, 
infamous in the statistics literature for being too greedy and highly unstable (see Friedman, 2001). 
This is intuitively obvious, since by increasing the coefficient until it saturates we are destroying 
“signal” which may help us select other good predictors. 


3. |, Margins, Support Vector Machines and Boosting 


We now introduce the concept of margins as a geometric interpretation of a binary classification 
model. In the context of boosting, this view offers a different understanding of AdaBoost from the 
gradient descent view presented above. In the following sections we connect the two views. 


3.1 The Euclidean Margin and the Support Vector Machine 


Consider a classification model in high dimensional predictor space: F(x) = 1) j;(x)B;. We say 
that the model separates the training data {x;,y;}"_, if sign(F(xi)) = yi, Vi. From a geometrical 
perspective this means that the hyper-plane defined by F(x) = 0 is a separating hyper-plane for this 
data, and we define its (Euclidean) margin as 


Spn o 
me a a 


The margin-maximizing separating hyper-plane for this data would be defined by B which max- 
imizes m2(B). Figure 3 shows a simple example of separable data in two dimensions, with its 
margin-maximizing separating hyper-plane. The Euclidean margin-maximizing separating hyper- 
plane is the (non regularized) support vector machine solution. Its margin maximizing properties 
play a central role in deriving generalization error bounds for these models, and form the basis for 
a rich literature. 





(6) 


3.2 The /,; Margin and Its Relation to Boosting 


Instead of considering the Euclidean margin as in (6) we can define an “l, margin” concept as 


m — min YE) 
EE Bl Á 





Of particular interest to us is the case p = 1. Figure 4 shows the /; margin maximizing separating 
hyper-plane for the same simple example as Figure 3. Note the fundamental difference between 
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Figure 3: A simple data example, with two observations from class “O” and two observations from 
class “X”. The full line is the Euclidean margin-maximizing separating hyper-plane. 





SaGa 














Figure 4: /; margin maximizing separating hyper-plane for the same data set as Figure 3. The 
difference between the diagonal Euclidean optimal separator and the vertical /; optimal 
separator illustrates the “sparsity” effect of optimal /; separation 
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the two solutions: the /,-optimal separator is diagonal, while the /,-optimal one is vertical. To 
understand why this is so we can relate the two margin definitions to each other as 


yF(x) _ yF(x) | |[Blle 
IBll Blo Bila 
IIBll2 


From this representation we can observe that the /; margin will tend to be big if the ratio iB is 


(8) 





big. This ratio will generally be big if B is sparse. To see this, consider fixing the /; norm of the 
vector and then comparing the /) norm of two candidates: one with many small components and the 
other—a sparse one—with a few large components and many zero components. It is easy to see that 
the second vector will have bigger l2 norm, and hence (if the l2 margin for both vectors is equal) a 
bigger lı margin. 

A different perspective on the difference between the optimal solutions is given by a theorem 
due to Mangasarian (1999), which states that the l, margin maximizing separating hyper plane 
maximizes the /, distance from the closest points to the separating hyper-plane, with 7 +a =1. 
Thus the Euclidean optimal separator (p = 2) also maximizes Euclidean distance between the points 
and the hyper-plane, while the /; optimal separator maximizes lœ distance. This interesting result 
gives another intuition why l; optimal separating hyper-planes tend to be coordinate-oriented (i.e., 
have sparse representations): since /.. projection considers only the largest coordinate distance, 
some coordinate distances may be 0 at no cost of decreased l distance. 

Schapire et al. (1998) have pointed out the relation between AdaBoost and the /; margin. They 
prove that, in the case of separable data, the boosting iterations increase the “boosting” margin of 
the model, defined as 


min (xi) 
i foul 





(9) 


In other words, this is the /; margin of the model, except that it uses the & incremental representation 
rather than the B “geometric” representation for the model. The two representations give the same 
lı norm if there is sign consistency, or “monotonicity” in the coefficient paths traced by the model, 
i.e., if at every iteration ¢ of the boosting algorithm 


B; #0 => sign(a,) = sign(B;,). (10) 


As we will see later, this monotonicity condition will play an important role in the equivalence 
between boosting and l; regularization. 

The /;-margin maximization view of AdaBoost presented by Schapire et al. (1998)—and a 
whole plethora of papers that followed—is important for the analysis of boosting algorithms for 
two distinct reasons: 


e It gives an intuitive, geometric interpretation of the model that AdaBoost is looking for—a 
model which separates the data well in this /;-margin sense. Note that the view of boosting as 
gradient descent in a loss criterion doesn’t really give the same kind of intuition: if the data 
is separable, then any model which separates the training data will drive the exponential or 
binomial loss to 0 when scaled up: 


m(B)>O0 = y'C(yi,dB'xi) —>0 as d —> o. 
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e The /;-margin behavior of a classification model on its training data facilitates generation 
of generalization (or prediction) error bounds, similar to those that exist for support vector 
machines (Schapire et al., 1998). The important quantity in this context is not the margin but 
the “normalized” margin, which considers the “conjugate norm” of the predictor vectors: 


yip' h(x) 
IBI ICi) [feo 


When the dictionary we are using is comprised of classifiers then ||/(x;) ||. = 1 always and 
thus the /; margin is exactly the relevant quantity. The error bounds described by Schapire 
et al. (1998) allow using the whole /; margin distribution, not just the minimal margin. How- 
ever, boosting’s tendency to separate well in the /; sense is a central motivation behind their 
results. 


From a statistical perspective, however, we should be suspicious of margin-maximization as a 
method for building good prediction models in high dimensional predictor space. Margin maxi- 
mization in high dimensional space is likely to lead to over-fitting and bad prediction performance. 
This has been observed in practice by many authors, in particular Breiman (1999). Our results in 
the next two sections suggest an explanation based on model complexity: margin maximization is 
the limit of parametric regularized optimization models, as the regularization vanishes, and the reg- 
ularized models along the path may well be superior to the margin maximizing “limiting” model, in 
terms of prediction performance. In Section 7 we return to discuss these issues in more detail. 


4. Boosting as Approximate Incremental /; Constrained Fitting 


In this section we introduce an interpretation of the generic coordinate-descent boosting algorithm 
as tracking a path of approximate solutions to /;-constrained (or equivalently, regularized) versions 
of its loss criterion. This view serves our understanding of what boosting does, in particular the 
connection between early stopping in boosting and regularization. We will also use this view to 
get a result about the asymptotic margin-maximization of regularized classification models, and 
by analogy of classification boosting. We build on ideas first presented by Hastie et al. (2001, 
Chapter 10) and Efron et al. (2004). 

Given a convex non-negative loss criterion C(-,-), consider the 1-dimensional path of optimal 
solutions to /; constrained optimization problems over the training data: 


A 


B(c) =arg min Y'C(y;,h(x;)'B). (11) 


Bllise“F 


AS c varies, we get that B(c) traces a 1-dimensional “optimal curve” through R”. If an optimal 
solution for the non-constrained problem exists and has finite /; norm co, then obviously B(c) = 
B(co) = B , Ve > cg. in the case of separable 2-class data, using either C, or C}, there is no finite- 
norm optimal solution. Rather, the constrained solution will always have ||B(c)||; = c. 

A different way of building a solution which has /; norm c, is to run our €-boosting algorithm 
for c/€ iterations. This will give an a(°/®) vector which has /; norm exactly c. For the norm of the 
geometric representation Ble/ ©) to also be equal to c, we need the monotonicity condition (10) to 
hold as well. This condition will play a key role in our exposition. 

We are going to argue that the two solution paths B(c) and B(/® are very similar for £ “small”. 
Let us start by observing this similarity in practice. Figure 1 in the introduction shows an example of 
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Figure 5: Another example of the equivalence between the Lasso optimal solution path (left) and 
-boosting with squared error loss. Note that the equivalence breaks down when the path 
of variable 7 becomes non-monotone 


this similarity for squared error loss fitting with /; (lasso) penalty. Figure 5 shows another example 
in the same mold, taken from Efron et al. (2004). The data is a diabetes study and the “dictionary” 
used is just the original 10 variables. The panel on the left shows the path of optimal /;-constrained 
solutions B(c) and the panel on the right shows the €-boosting path with the 10-dimensional dictio- 
nary (the total number of boosting iterations is about 6000). The 1-dimensional path through R!° 
is described by 10 coordinate curves, corresponding to each one of the variables. The interesting 
phenomenon we observe is that the two coefficient traces are not completely identical. Rather, they 
agree up to the point where variable 7 coefficient path becomes non monotone, i.e., it violates (10) 
(this point is where variable 8 comes into the model, see the arrow on the right panel). This example 
illustrates that the monotonicity condition—and its implication that |||]; = ||B||1—is critical for the 
equivalence between €-boosting and /;-constrained optimization. 

The two examples we have seen so far have used squared error loss, and we should ask ourselves 
whether this equivalence stretches beyond this loss. Figure 6 shows a similar result, but this time for 
the binomial log-likelihood loss, C;. We used the “spam” data set, taken from the UCI repository 
(Blake and Merz, 1998). We chose only 5 predictors of the 57 to make the plots more interpretable 
and the computations more accommodating. We see that there is a perfect equivalence between the 
exact constrained solution (i.e., regularized logistic regression) and €-boosting in this case, since the 
paths are fully monotone. 

To justify why this observed equivalence is not surprising, let us consider the following “l1- 
locally optimal monotone direction” problem of finding the best monotone € increment to a given 
model Bo: 


min C(p) (12) 


st. |[Bl]i —|[Bolla < €, 
IB] = |Bo| (component-wise). 
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Exact constrained solution e—Stagewise 
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Figure 6: Exact coefficient paths (left) for /;-constrained logistic regression and boosting coefficient 
paths (right) with binomial log-likelihood loss on five variables from the “spam” data set. 
The boosting path was generated using € = 0.003 and 7000 iterations. 


Here we use C() as shorthand for Y;C(y;,4(x;)’B). A first order Taylor expansion gives us 


C(B) = C(Bo) + VC(Bo)’ (B — Bo) + O(e’). 


And given the /; constraint on the increase in ||B||;, it is easy to see that a first-order optimal solution 
(and therefore an optimal solution as € — 0) will make a “coordinate descent” step, i.e. 


B; ABoj; = |VC(Bo) j] = max |VC(Bo)xl, 


assuming the signs match, i.e., sign(Bo;) = —sign(VC(Bo) ;). 

So we get that if the optimal solution to (12) without the monotonicity constraint happens to be 
monotone, then it is equivalent to a coordinate descent step. And so it is reasonable to expect that if 
the optimal /; regularized path is monotone (as it indeed is in Figures 1,6), then an “infinitesimal” 
€-boosting algorithm would follow the same path of solutions. Furthermore, even if the optimal 
path is not monotone, we can still use the formulation (12) to argue that ¢-boosting would tend to 
follow an approximate /-regularized path. The main difference between the €-boosting path and 
the true optimal path is that it will tend to “delay” becoming non-monotone, as we observe for 
variable 7 in Figure 5. To understand this specific phenomenon would require analysis of the true 
optimal path, which falls outside the scope of our discussion—Efron et al. (2004) cover the subject 
for squared error loss, and their discussion applies to any continuously differentiable convex loss, 
using second-order approximations. 

We can employ this understanding of the relationship between boosting and l; regularization 
to construct l, boosting algorithms by changing the coordinate-selection criterion in the coordinate 
descent algorithm. We will get back to this point in Section 7, where we design an “l boosting” 
algorithm. 

The experimental evidence and heuristic discussion we have presented lead us to the following 
conjecture which connects slow boosting and /;-regularized optimization: 
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Conjecture 2 Consider applying the ¢-boosting algorithm to any convex loss function, generating 
a path of solutions B® (t). Then if the optimal coefficient paths are monotone Ye < co, i.e., if 
Vj, |B(c);| is non-decreasing in the range c < co, then 


tim B® (ey/e) = B(co). 


Efron et al. (2004, Theorem 2) prove a weaker “local” result for the case of squared error loss 
only. We generalize their result to any convex loss. However this result still does not prove the 
“global” convergence which the conjecture claims, and the empirical evidence implies. For the sake 
of brevity and readability, we defer this proof, together with concise mathematical definition of the 
different types of convergence, to appendix A. 

In the context of “real-life” boosting, where the number of basis functions is usually very large, 
and making € small enough for the theory to apply would require running the algorithm forever, 
these results should not be considered directly applicable. Instead, they should be taken as an intu- 
itive indication that boosting—especially the € version—is, indeed, approximating optimal solutions 
to the constrained problems it encounters along the way. 


5. /,-Constrained Classification Loss Functions 


Having established the relation between boosting and /; regularization, we are going to turn our 
attention to the regularized optimization problem. By analogy, our results will apply to boosting 
as well. We concentrate on C, and C}, the two classification losses defined above, and the solution 
paths of their l, constrained versions: 


A 


BP (c) = arg min $ Ci, P'h(x:)). (13) 


IBlo<c 7 


where C is either Ce or Cı. As we discussed below Equation (11), if the training data is separable in 
span(#), then we have ||B'?)(c)||, = c for all values of c. Consequently 


B(P)(¢ 
jy, =1. 


We may ask what are the convergence points of this sequence as c — œ. The following theorem 
shows that these convergence points describe “/,,-margin maximizing” separating hyper-planes. 





Theorem 3 Assume the data is separable, i.e., 3B s.t.Vi, y;B'h(x;) > 0. 


Then for both C; and C4, every convergence point of Plc) corresponds to an I,-margin-maximizing 
separating hyper-plane. 

If the |)-margin-maximizing separating hyper-plane is unique, then it is the unique convergence 
points, i.e. 





: ĝ(p) 
BY) = lim pele) =arg max miny;f’h(x;). (14) 


eae Ae IIBllp=1 


Proof This proof applies to both C, and C), given the property in (5). Consider two separating 
candidates Bı and B2 such that || ||, = ||B2||p = 1. Assume that Bı separates better, i.e. 


mı i= min yiB h(x) > m:= min yiByh(xi) > 0. 


Then we have the following simple lemma: 
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Lemma 4 There exists some D = D(m,,mz) such that Yd > D, dB, incurs smaller loss than dBo, 
in other words: 


eC: dBi A(x) < XCO: dBoh(xi)). 


: : Bi) : 
Given this lemma, we can now prove that any convergence point of BY) must be an / p-margin 


E 
maximizing separator. Assume ß* is a convergence point of aol Denote its minimal margin on 
the data by m“. If the data is separable, clearly m* > 0 (since otherwise the loss of dB* does not 
even converge to 0 as d — œ). 
Now, assume some with ||| p = 1 has bigger minimal margin m > m*. By continuity of the 


minimal margin in B, there exists some open neighborhood of B* 


Ng = {P : IIB — B“ ll2 < 8} 


and an € > 0, such that 
miny;B'A(xi) <m—e, VB E Ng. 
l 


Now by the lemma we get that there exists some D = D(ñm, ñ — £) such that d B incurs smaller 


loss than dB for any d > D, P € Np». Therefore B* cannot be a convergence point of Bele), 


7 (p) ‘ ae 
We conclude that any convergence point of the sequence tt) must be an /,-margin maximiz- 
ing separator. If the margin maximizing separator is unique then it is the only possible convergence 
point, and therefore 


A Rp) 
ĵo) = lim Pidas arg miz miny;B'h(x;). 
c— Cc p=l ! 


Proof of Lemma Using (5) and the definition of Ce, we get for both loss functions: 
$} CO: dBi A(x) < nexp(—d- m1). 
i 


Now, since Bı separates better, we can find our desired 


l log2 
17 2 


such that 
Vd > D, nexp(—d : mı) < 0.5exp(—d - m2). 


And using (5) and the definition of C, again we can write 
0.5exp(—d - m2) < $ Ci dBah(xi)). 
i 
Combining these three inequalities we get our desired result: 


vd >D, YC; dih(xi)) < E Clr, ABoh(xi). 
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We thus conclude that if the /,-margin maximizing separating hyper-plane is unique, the nor- 
malized constrained solution converges to it. In the case that the margin maximizing separating 
hyper-plane is not unique, we can in fact prove a stronger result, which indicates that the limit of 
the regularized solutions would then be determined by the second smallest margin, then by the third 
and so on. This result is mainly of technical interest and we prove it in Appendix B, Section 2. 


5.1 Implications of Theorem 3 


We now briefly discuss the implications of this theorem for boosting and logistic regression. 


5.1.1 BOOSTING IMPLICATIONS 


Combined with our results from Section 4, Theorem 3 indicates that the normalized boosting path 
Bo 


Lu<t Oy 
A 


B, which attains 


—with either C; or C; used as loss—“approximately” converges to a separating hyper-plane 


max miny;B’A(x;) = max ||B||/2miny;d;, (15) 
|Bla=1 i lB =1 i 

where d; is the (signed) Euclidean distance from the training point i to the separating hyper-plane. In 
other words, it maximizes Euclidean distance scaled by an l2 norm. As we have mentioned already, 
this implies that the asymptotic boosting solution will tend to be sparse in representation, due to the 
fact that for fixed /; norm, the l2 norm of vectors that have many 0 entries will generally be larger. 
In fact, under rather mild conditions, the asymptotic solution B = lime <0 B (ec) / c, will have at 
most n (the number of observations) non-zero coefficients, if we use either C; or C; as the loss. See 
Appendix B, Section 1 for proof. 


5.1.2 LOGISTIC REGRESSION IMPLICATIONS 


Recall, that the logistic regression (maximum likelihood) solution is undefined if the data is sepa- 
rable in the Euclidean space spanned by the predictors. Theorem 3 allows us to define a logistic 
regression solution for separable data, as follows: 


1. Set a high constraint value Cmax 


2. Find B') (Cmax), the solution to the logistic regression problem subject to the constraint ||| p$ 
Cmax. The problem is convex for any p > 1 and differentiable for any p > 1, so interior point 
methods can be used to solve this problem. 


3. Now you have (approximately) the /,-margin maximizing solution for this data, described by 


A 


RP) (Cmax) 


Cmax 


This is a solution to the original problem in the sense that it is, approximately, the convergence 
point of the normalized /,-constrained solutions, as the constraint is relaxed. 
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Of course, with our result from Theorem 3 it would probably make more sense to simply find the 
optimal separating hyper-plane directly—this is a linear programming problem for lı separation and 
a quadratic programming problem for l2 separation. We can then consider this optimal separator as 
a logistic regression solution for the separable data. 


6. Examples 


We now apply boosting to several data sets and interpret the results in light of our regularization and 
margin-maximization view. 


6.1 Spam Data Set 


We now know if the data are separable and we let boosting run forever, we will approach the same 
“optimal” separator for both C, and Cz. However if we stop early—or if the data is not separable— 
the behavior of the two loss functions may differ significantly, since Ce weighs negative margins 
exponentially, while C; is approximately linear in the margin for large negative margins (see Fried- 
man et al., 2000). Consequently, we can expect Ce to concentrate more on the “hard” training data, 
in particular in the non-separable case. Figure 7 illustrates the behavior of €-boosting with both 


Minimal margins Test error 
T T 0.095 T 











exponential 





—- logistic 
0.09} AdaBoost 4 











0.085} 4 
0.08 H 4 
it 


0.075}! 4 


test error 


0.07 


minimal margin 


0.065 


0.06 


0.055 











—— "exponential 0.05 
i —. logistic 
i AdaBoost 





























= 5 0.045 r 5 5 
10 10 10 T 
UBI UBI 


Figure 7: Behavior of boosting with the two loss functions on spam data set 


loss functions, as well as that of AdaBoost, on the spam data set (57 predictors, binary response). 
We used 10 node trees and € = 0.1. The left plot shows the minimal margin as a function of the 
lı norm of the coefficient vector ||B||ı. Binomial loss creates a bigger minimal margin initially, 
but the minimal margins for both loss functions are converging asymptotically. AdaBoost initially 
lags behind but catches up nicely and reaches the same minimal margin asymptotically. The right 
plot shows the test error as the iterations proceed, illustrating that both -methods indeed seem to 
over-fit eventually, even as their “separation” (minimal margin) is still improving. AdaBoost did not 
significantly over-fit in the 1000 iterations it was allowed to run, but it obviously would have if it 
were allowed to run on. 

We should emphasize that the comparison between AdaBoost and €-boosting presented consid- 
ers as a basis for comparison the /; norm, not the number of iterations. In terms of computational 
complexity, as represented by the number of iterations, AdaBoost reaches both a large minimal mar- 
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gin and good prediction performance much more quickly than the “slow boosting” approaches, as 
AdaBoost tends to take larger steps. 


6.2 Simulated Data 


To make a more educated comparison and more compelling visualization, we have constructed an 
example of separation of 2-dimensional data using a 8-th degree polynomial dictionary (45 func- 
tions). The data consists of 50 observations of each class, drawn from a mixture of Gaussians, and 
presented in Figure 8. Also presented, in the solid line, is the optimal l; separator for this data in 
this dictionary (easily calculated as a linear programming problem - note the difference from the l2 
optimal decision boundary, presented in Section 7.1, Figure 11 ). The optimal /; separator has only 
12 non-zero coefficients out of 45. 
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Figure 8: Artificial data set with /;-margin maximizing separator (solid), and boosting models af- 
ter 10° iterations (dashed) and 10° iterations (dotted) using € = 0.001. We observe the 
convergence of the boosting separator to the optimal separator 


We ran an €-boosting algorithm on this data set, using the logistic log-likelihood loss C;, with 
e = 0.001, and Figure 8 shows two of the models generated after 10° and 3 - 10° iterations. We see 
that the models seem to converge to the optimal separator. A different view of this convergence is 
given in Figure 9, where we see two measures of convergence: the minimal margin (left, maximum 
value obtainable is the horizontal line) and the /;-norm distance between the normalized models 
(right), given by 
P p” 
H a l 
F [BO Il 


where B is the optimal separator with /; norm 1 and BC is the boosting model after t iterations. 

We can conclude that on this simple artificial example we get nice convergence of the logistic- 
boosting model path to the /;-margin maximizing separating hyper-plane. 

We can also use this example to illustrate the similarity between the boosted path and the path 
of l optimal solutions, as we have discussed in Section 4. 
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Figure 9: Two measures of convergence of boosting model path to optimal /; separator: minimal 
margin (left) and l; distance between the normalized boosting coefficient vector and the 


optimal model (right) 
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Figure 10: Comparison of decision boundary of boosting models (broken) and of optimal con- 


strained solutions with same norm (full) 


Figure 10 shows the class decision boundaries for 4 models generated along the boosting path, 
compared to the optimal solutions to the constrained “logistic regression” problem with the same 
bound on the /; norm of the coefficient vector. We observe the clear similarities in the way the 
solutions evolve and converge to the optimal /; separator. The fact that they differ (in some cases 
significantly) is not surprising if we recall the monotonicity condition presented in Section 4 for 
exact correspondence between the two model paths. In this case if we look at the coefficient paths 
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(not shown), we observe that the monotonicity condition is consistently violated in the low norm 
ranges, and hence we can expect the paths to be similar in spirit but not identical. 


7. Discussion 


We can now summarize what we have learned about boosting from the previous sections: 
e Boosting approximately follows the path of /,-regularized models for its loss criterion 


e If the loss criterion is the exponential loss of AdaBoost or the binomial log-likelihood loss 
of logistic regression, then the /; regularized model converges to an /}-margin maximizing 
separating hyper-plane, if the data are separable in the span of the weak learners 


We may ask, which of these two points is the key to the success of boosting approaches. One 
empirical clue to answering this question, can be found in Breiman (1999), who programmed an 
algorithm to directly maximize the margins. His results were that his algorithm consistently got 
significantly higher minimal margins than AdaBoost on many data sets (and, in fact, a “higher” 
margin distribution beyond the minimal margin), but had slightly worse prediction performance. His 
conclusion was that margin maximization is not the key to AdaBoost’s success. From a statistical 
perspective we can embrace this conclusion, as reflecting the importance of regularization in high- 
dimensional predictor space. By our results from the previous sections, “margin maximization” 
can be viewed as the limit of parametric regularized models, as the regularization vanishes. Thus 
we would generally expect the margin maximizing solutions to perform worse than regularized 
models. In the case of boosting, regularization would correspond to “early stopping” of the boosting 
algorithm. 


7.1 Boosting and SVMs as Regularized Optimization in High-dimensional Predictor Spaces 


Our exposition has led us to view boosting as an approximate way to solve the regularized optimiza- 
tion problem 


min $ Ci B'h) + Blt (16) 


which converges as A — 0 to BO), if our loss is Ce or C;. In general, the loss C can be any convex 
differentiable loss and should be defined to match the problem domain. 

Support vector machines can be described as solving the regularized optimization problem (see 
Friedman et al., 2000, Chapter 12) 


min (1 —yiB'A(x;)) + +.B a7 


which “converges” as A — 0 to the non-regularized support vector machine solution, i.e., the optimal 
Euclidean separator, which we denoted by G2), 

An interesting connection exists between these two approaches, in that they allow us to solve 
the regularized optimization problem in high dimensional predictor space: 





4. It can be argued that margin-maximizing models are still “regularized” in some sense, as they minimize a norm 
criterion among all separating models. This is arguably the property which still allows them to generalize reasonably 
well in many cases. 
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e We are able to solve the lı- regularized problem approximately in very high dimension via 
boosting by applying the “approximate coordinate descent’ trick of building a decision tree 
(or otherwise greedily selecting a weak learner) based on re-weighted versions of the data. 


e Support vector machines facilitate a different trick for solving the regularized optimization 
problem in high dimensional predictor space: the “kernel trick”. If our dictionary H spans 
a Reproducing Kernel Hilbert Space, then RKHS theory tells us we can find the regularized 
solutions by solving an n-dimensional problem, in the space spanned by the kernel represen- 
ters {K(x;,x)}. This fact is by no means limited to the hinge loss of (17), and applies to any 
convex loss. We concentrate our discussion on SVM (and hence hinge loss) only since it is 
by far the most common and well-known application of this result. 


So we can view both boosting and SVM as methods that allow us to fit regularized models in 
high dimensional predictor space using a computational “shortcut”. The complexity of the model 
built is controlled by regularization. These methods are distinctly different than traditional statistical 
approaches for building models in high dimension, which start by reducing the dimensionality of 
the problem so that standard tools (e.g., Newton’s method) can be applied to it, and also to make 
over-fitting less of aconcern. While the merits of regularization without dimensionality reduction— 
like Ridge regression or the Lasso—are well documented in statistics, computational issues make it 
impractical for the size of problems typically solved via boosting or SVM, without computational 
tricks. 

We believe that this difference may be a significant reason for the enduring success of boosting 
and SVM in data modeling, i.e.: 


Working in high dimension and regularizing is statistically preferable to a two-step 
procedure of first reducing the dimension, then fitting a model in the reduced space. 


It is also interesting to consider the differences between the two approaches, in the loss (flexible 
vs. hinge loss), the penalty (/; vs. l2), and the type of dictionary used (usually trees vs. RKHS). 
These differences indicate that the two approaches will be useful for different situations. For ex- 
ample, if the true model has a sparse representation in the chosen dictionary, then l; regularization 
may be warranted; if the form of the true model facilitates description of the class probabilities via 
a logistic-linear model, then the logistic loss C; is the best loss to use, and so on. 

The computational tricks for both SVM and boosting limit the kind of regularization that can 
be used for fitting in high dimensional space. However, the problems can still be formulated and 
solved for different regularization approaches, as long as the dimensionality is low enough: 


e Support vector machines can be fitted with an lı penalty, by solving the /-norm version of the 
SVM problem, equivalent to replacing the /2 penalty in (17) with an l; penalty. In fact, the 1- 
norm SVM is used quite widely, because it is more easily solved in the “linear”, non-RKHS, 
situation (as a linear program, compared to the standard SVM which is a quadratic program) 
and tends to give sparser solutions in the primal domain. 


e Similarly, we describe below an approach for developing a “boosting” algorithm for fitting 
approximate l) regularized models. 


Both of these methods are interesting and potentially useful. However they lack what is arguably 
the most attractive property of the “standard” boosting and SVM algorithms: a computational trick 
to allow fitting in high dimensions. 
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7.1.1 AN l, BOOSTING ALGORITHM 


We can use our understanding of the relation of boosting to regularization and Theorem 3 to for- 
mulate /,-boosting algorithms, which will approximately follow the path of /,-regularized solutions 
and converge to the corresponding /,-margin maximizing separating hyper-planes. Of particular 
interest is the /) case, since Theorem 3 implies that /2-constrained fitting using C; or Ce will build a 
regularized path to the optimal separating hyper-plane in the Euclidean (or SVM) sense. 

To construct an l2 boosting algorithm, consider the “equivalent” optimization problem (12), and 
change the step-size constraint to an l2 constraint: 


[Bll2—||Boll2 < €. 


It is easy to see that the first order solution to this problem entails selecting for modification the 
coordinate which maximizes 
VC(Bo)k 


Boe 
and that subject to monotonicity, this will lead to a correspondence to the locally /2-optimal direc- 
tion. 
Following this intuition, we can construct an lz boosting algorithm by changing only step 2(c) 
of our generic boosting algorithm of Section 2 to 


| Diwil, (x)| 
ihe re 


2(c)* Identify j; which maximizes iB, 
Jt 


Note that the need to consider the current coefficient (in the denominator) makes the /2 algorithm 
appropriate for toy examples only. In situations where the dictionary of weak learner is prohibitively 
large, we will need to figure out a trick like the one we presented in Section 2.1, to allow us to make 
an approximate search for the optimizer of step 2(c)*. 

Another problem in applying this algorithm to large problems is that we never choose the same 
dictionary function twice, until all have non-0 coefficients. This is due to the use of the /2 penalty, 
where the current coefficient value affects the rate at which the penalty term is increasing. In par- 
ticular, if B; = 0 then increasing it causes the penalty term ||B||2 to increase at rate 0, to first order 
(which is all the algorithm is considering). 

The convergence of our lz boosting algorithm on the artificial data set of Section 6.2 is illustrated 
in Figure 11. We observe that the l2 boosting models do indeed approach the optimal l2 separator. 
It is interesting to note the significant difference between the optimal l) separator as presented in 
Figure 11 and the optimal /; separator presented in Section 6.2 (Figure 8). 


8. Summary and Future Work 


In this paper we have introduced a new view of boosting in general, and two-class boosting in 
particular, comprised of two main points: 


e We have generalized results from Efron et al. (2004) and Hastie et al. (2001), to describe 
boosting as approximate /;-regularized optimization. 


e We have shown that the exact /;-regularized solutions converge to an /j-margin maximizing 
separating hyper-plane. 
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Figure 11: Artificial data set with /,-margin maximizing separator (solid), and l2-boosting models 
after 5 x 10° iterations (dashed) and 108 iterations (dotted) using € = 0.0001. We observe 
the convergence of the boosting separator to the optimal separator 


We hope our results will help in better understanding how and why boosting works. It is an interest- 
ing and challenging task to separate the effects of the different components of a boosting algorithm: 


e Loss criterion 
e Dictionary and greedy learning method 
e Line search / slow learning 


and relate them to its success in different scenarios. The implicit /; regularization in boosting 
may also contribute to its success, as it has been shown that in some situations /; regularization is 
inherently superior to others (see Donoho et al., 1995). 

An important issue when analyzing boosting is over-fitting in the noisy data case. To deal with 
over-fitting, Ratsch et al. (2001b) propose several regularization methods and generalizations of the 
original AdaBoost algorithm to achieve a soft margin by introducing slack variables. Our results 
indicate that the models along the boosting path can be regarded as /; regularized versions of the 
optimal separator, hence regularization can be done more directly and naturally by stopping the 
boosting iterations early. It is essentially a choice of the l; constraint parameter c. 

Many other questions arise from our view of boosting. Among the issues to be considered: 


e Is there a similar “separator” view of multi-class boosting? We have some tentative results to 
indicate that this might be the case if the boosting problem is formulated properly. 


e Can the constrained optimization view of boosting help in producing generalization error 
bounds for boosting that would be more tight than the current existing ones? 
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Appendix A. Local Equivalence of Infinitesimal ¢-Boosting and /;-Constrained 
Optimization 


As before, we assume we have a set of training data (x1,y1),(X2,y2),---(Xn,¥n), a smooth cost 
function C(y,F), and a set of basis functions (h; (x),/2(x),...47(x)). 
We denote by B(s) be the optimal solution of the /;-constrained optimization problem: 


min X C(yi,h(xi)'B) (18) 
i=l 
subjectto |B] <s. (19) 


Suppose we initialize the €-boosting version of Algorithm 1, as described in Section 2, at B(s) and 
run the algorithm for T steps. Let B(T) denote the coefficients after T steps. 
The “global convergence” Conjecture 2 in Section 4 implies that VAs > 0: 


A 


B(As/e) > B(s+As) as € 0 


under some mild assumptions. Instead of proving this “global” result, we show here a “local” result 
by looking at the derivative of B(s). Our proof builds on the proof by Efron et al. (2004, Theorem 2) 
of a similar result for the case that the cost is squared error loss C(y, F) = (y — F )?. Theorem 1 below 
shows that if we start the -boosting algorithm at a solution B(s) of the /;-constrained optimization 
problem (18)—(19), the “direction of change” of the €-boosting solution will agree with that of the 
J,-constrained optimization problem. 


Theorem 1 Assume the optimal coefficient paths B j(s) Vj are monotone in s and the coefficient 
paths B;(T) Vj are also monotone as €-boosting proceeds, then 


Bir) Bb) — VB(s) as €>0,T > ~,T-€—0. 


Proof First we introduce some notations. Let 
hj = (j(x1),...Aj(Xn))’ 


be the jth basis function evaluated at the n training data. 
Let 
F = (F(x,),...F(Xn))’ 


be the vector of current fit. 
Let 





_ ( ƏC, Fı) dC (Yn, Fa) V 
r= Ra 
Ə OF, 
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be the current “generalized residual” vector as defined in Friedman (2001). 
Let 


cj=hr, j=1,...J 


be the current “correlation” between h; and r. 
Let 
A= {j:|c;|= marlen 


be the set of indices for the maximum absolute correlation. 
For clarity, we re-write this €-boosting algorithm, starting from B(s), as a special case of Algo- 
rithm 1, as follows: 


(1) Initialize B(0) = B(s),Fo =F,ro =r. 
(2) Fort =1:T 
(a) Find j; = arg max; |hir;—1|. 
(b) Update 
Bij, — Br-1,j, +E- sign(cj,) 
(c) Update F; and r;. 


Notice in the above algorithm, we start from B(s), rather than 0. As proposed in Efron et al. (2004), 
we consider an idealized €-boosting case: € — 0. As € — 0, T — œ and T € — 0, under the 
monotone paths condition, Section 3.2 and Section 6 of Efron et al. (2004) showed 








FF, —F 
ae (20) 
— sy, (21) 


where u and v satisfy two constraints: 


(Constraint 1) u is in the convex cone generated by {sign(c;)h;: j € A}, i.e.: 


u= )° Pjsign(cj)hj,P; > 0. 
jea 


(Constraint 2) v has equal “correlation” with sign(c;)hj, j € A: 


sign(cj)h' v =a for j € A. 


The first constraint is true because the basis functions in AC will not be able to catch up in terms 
of |c;| for sufficiently small T -€; the P;’s are non-negative because the coefficient paths B;(T) are 
monotone. The second constraint can be seen by taking a Taylor expansion of C(y,F) around Fo 
to the quadratic term, letting T -€ go to zero and applying the result for the squared error loss from 
Efron et al. (2004). Once the two constraints are established, we notice that 


_&C(yi,F) 


OF? 


Uj. 
Fo(xi) 
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Hence we can plug the constraint 1 into the constraint 2 and get the following set of equations: 


ALWAgP = hal, 


where 
Ha = (---sign(cj)hj---), 7 € A, 
PC(yi,F ) 
W = diag = ; 
( OF2 Fo(x;) 





P = (Pp), JOA. 


If Ē is of rank |.A| (we will get back to this issue in details in Appendix B), then P, or equivalently 
u and v, are uniquely determined up to a scale number. 

Now we consider the /,-constrained optimization problem (18)-(19). Let F(s) be the fitted 
vector and #(s) be the corresponding residual vector. Since F(s) and f(s) are smooth, define 


F (s+ As) — F(s) 





ar | 22 
i KA As , 
_ F(s+As) —F(s) 
* = | 2 
y rear As Co 


Lemma 2 Under the monotone coefficient paths assumption, u* and v* also satisfy constraints 1—2. 


Proof Write the coefficient B; as Bt — B7, where 


{ Bi =B;,B; =0 if B;>0, 
BY =0,8; =-B; if Bj)<0. 


The /;-constrained optimization problem (18)-(19) is then equivalent to 


ain YC (nha (Bt —B-)), (24) 
, i=1 
subjectto —|[B* ||; + IIB lli < s,B* > 0,87 > 0. (25) 





The corresponding Lagrangian dual is 


n J 
L = )¥C(yi,h(xi)(B*-B-)) + TLI (Bi +67) (26) 
i=l j=l 
J 
hes SOM By — B78 A 
j=1 j=1 





where A > 0, At > 0,4; = 0 are Lagrange multipliers. 
By differentiating the Lagrangian dual, we get the solution of (24)-(25) needed to satisfy the 
following Karush-Kuhn-Tucker conditions: 


OL 
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OL 


VB; = 0, (30) 
Apr 20. BI) 


Let cj = h/f and A = {j : |c j| = max; |c;|}. We can see the following facts from the Karush-Kuhn- 
Tucker conditions: 


(Fact 1) Use (28), (29) and A > 0,47 A; > 0, we have |c;| < À. 


(Fact 2) If ĝ j £0, then |c;| = A and j € A. For example, suppose Br # 0, then AT = 0 and 
(28) implies cj = À. 


(Fact 3) If B; # 0, sign(B;) = sign(c;). 
We also note that: 


e Br and B; can not both be non-zero, otherwise AT = ri = 0, (28) and (29) can not hold at 
the same time. 


e It is possible that B j =O and j € A. This only happens for a finite number of s values, where 
basis h; is about to enter the model. 


For sufficiently small As, since the second derivative of the cost function C(y, F) is finite, 4 will 
stay the same. Since j € A if B; 40, the change in the fitted vector is 


F(s + As) —F(s) = Ł Q;hj. 
jea 


Since sign(B j) = sign(c;) and the coefficients B j change monotonically, sign(Q;) will agree with 
sign(c;). Hence we have 
F(s+As) —F(s 
As 





) = Ł Pjsign(c;)hj. (32) 
jea 

This implies u* satisfies constraint 1. The claim v* satisfies constraint 2 follows directly from fact 

2, since both f(s + As) and f(s) satisfy constraint 2. E 


Completion of proof of Theorem (1): We further notice that in both the -boosting case and the 
constrained optimization case, we have } jca P; = 1 by definition and the monotone coefficient 
paths condition, hence u and v are uniquely determined, i.e.: 


u=u“ and v= v*. 


To translate the result into B(s) and B(T), we notice F(x) = h(x)'B. Efron et al. (2004) showed 
that for VB(s) to be well defined, 4 can have at most n elements, i.e., |A| < n. We give sufficient 
conditions for when this is true in Appendix B. 
Now Let 
Hg= (---hj(xi)-++) i= 1,...mjEA 
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be an x |A| matrix, which we assume is of rank |.4|. Then VB(s) is given by 
VB(s) = (HqWHa) ' Ha Wu", 


and x 
B(T) — B(s) 

T-€ 
Hence the theorem is proved. | 


> (HWHa) 'H4Wu. 


Appendix B. Uniqueness and Existence Results 


In this appendix, we give some details on the properties of regularized solution paths. In section B.1 
we formulate and prove sparseness and uniqueness results on /;-regularized solutions for any convex 
loss. In section B.2 we extend Theorem 3 of Section 5—which proved the margin maximizing 
property of the limit of /,-regularized solutions, as regularization varies—to the case that the margin 
maximizing solution is not unique. 


B.1 Sparseness and Uniqueness of /;-Regularized Solutions and Their Limits 


Consider the /;-constrained optimization problem: 
n 
min )°C(y;,B/A(x;)). (33) 
Bila a iZi 
In this section we give sufficient conditions for the following properties of the solutions of (33): 
1. Existence of a sparse solution (with at most n non-zero coefficients), 
2. Non-existence of non-sparse solutions with more than n non-zero coefficients, 


3. Uniqueness of the solution, 


4. Convergence of the solutions to sparse solution, as c increases. 


Theorem 3 Assume that the unconstrained solution for problem (33) has lı norm bigger than c. 
Then there exists a solution of (33) which has at most n non-zero coefficients. 


Proof As Lemma 2 in the Appendix A, we will prove the theorem using the Karush-Kuhn-Tucker 
(KKT) formulation of the optimization problem. 
The chain rule for differentiation gives us that 


dL: C BA(X)) _ 
op; 


where h; and r(B) are defined in the Appendix A; r(B) is the “generalized residual” vector. Using 
this simple relationship and fact 2 of Lemma 2 we can write a system of equations for all non- 
zero coefficients at the optimal constrained solution as follows (denote by A the set of indices for 
non-zero coefficients): 





h/r(B), (34) 


Har(B) =A. signBa. (35) 
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In other words, we get |A| equations in |A| variables, corresponding to the non-zero f ;’s. 

However, each column of the matrix H 4 is of length n, and so Hg can have at most n linearly 
independent columns, rank(Ha) < n. Assume now that we have an optimal solution for (33) with 
|A| >n. Then there exists / € A such that 


h= } ah; (36) 
jeA,jAl 


Substituting (36) into the I’th row in (35) we get 


( È} ojh;)'r(B) =À- signBy. (37) 
jeTjAl 
But from (35) we know that h’r(B) = A-signB; , Vj € A, meaning we can re-phrase (37) as 
» a; -signB;-signB; = 1. (38) 
jEA,j#l 


In other words, we get that h; is a linear combination of the columns of H 4—4, which must obey 
the specific numeric relation in (38). 

Now we can construct an alternative optimal solution for (33) with one less non-zero coefficient, 
as follows: 


1. Start from B 


2. Define the direction y in coefficient space implied by (36), that is: 
yi = —signB; , Yj = 0,;-signB; , Vj € A— {1} 


3. Move in direction y until some coefficient in 4 hits zero, i.e., define: 





d =min{s>0 : Jj E Ast. Bj +7j5=0} 
(we know that ô* < |B) 
4. Set B=B+8*y 
Then from (36) we get that B’h(x;) = B/A(x;) , Vi and from (38) we get that 


lli = WBla— YB; +78"! - Bil] = (39) 
jEA 
= pn -3-(1- ps sen se = |Bill. 
jEA-l 


So B generates the same fit as B and has the same /; norm, therefore it is also an optimal solution, 
with at least one less non-zero coefficient (from the definition of 6*). 

We can obviously apply this process repeatedly until we get a solution with at most n non-zero 
coefficients. a 


This theorem has the following immediate implication: 
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Corollary 4 [f there is no set of more than n dictionary functions which obeys the equalities (36,38) 
on the training data, then any solution of (33) has at most n non-zero coefficients. 


This corollary implies, for example, that if the basis functions come from a “continuous non- 
redundant” distribution (which means that any equality would hold with probability 0) then with 
probability 1 any solution of (33) has at most n non-zero coefficients. 


Theorem 5 Assume that there is no set of more than n dictionary functions which obeys the equal- 
ities (36,38) on the training data. In addition assume: 


1. The loss function C is strictly convex (squared error loss, Cı and C; obviously qualify), 
2. No set of dictionary functions of size < n is linearly dependent on the training data. 


Then the problem (33) has a unique solution. 


Proof The previous corollary tells us that any solution has at most n non-zero coefficients. Now 
assume 8, B2 are both solutions of (33). From strict convexity of the loss we get that 


h(X)'Bi = h(X)'Bo = h(X)'(aBi + (1—0)B2) , YO <a <1; (40) 
and from convexity of the /; norm we get 


laß + (= &) Balt < Bill = [Balla = c. (41) 


So (&ßBı + (1 — &)B2) must also be a solution. Thus, the total number of variables with non-zero 
coefficients in either Bı or By cannot be bigger than n, since then (a8; + (1 — @)B2), would have 
> n non-zero coefficients for almost all values of &, contradicting Corollary 4. Thus, by ignoring 
all coefficients which are 0 in both B; and B2 we get that both B; and B2 can be represented in the 
same n — dimensional (maximum) sub-space of R’. Which leads to a contradiction between (40) 
and assumption 2. a 


Corollary 6 Consider a sequence {80 :0 <c < œ} of normalized solutions to the problem (33). 
Assume that all these solutions have at most n non-zero coefficients. Then any limit point of the 
sequence has at most n non-zero coefficients. 


Proof This is a trivial consequence of convergence. Assume by contradiction B* is a convergence 
point with more than n non-zero coefficients. Let k = arg min ;{|Bj| : B; A 0}. Then for any vector 


8 with at most n non-zero coefficients we know that || — B* > |B%| > 0 so we get a contradiction 
j 8 
to convergence. A 
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B.2 Uniqueness of Limiting Solution in Theorem 3 when Margin Maximizing Separator is 
not Unique 

Recall, that we are interested in convergence points of the normalized regularized solutions p Ete 
Theorem 3 proves that any such convergence point corresponds to an /,-margin maximizing sep- 
arating hyper-plane. We now extend it to the case that this first-order separator is not unique, by 
extending the result to consider the second smallest margin as a “tie breaker”. We show that any 
convergence point maximizes the second smallest margin among all models with maximal minimal 
margin. If there are also ties in the second smallest margin, then any limit point maximizes the third 
smallest margin among all models which still remain, and so on. It should be noted that the minimal 
margin is typically not attained by one observation only in margin maximizing models. In case of 
ties in the smallest margins our reference to “smallest”, “second smallest” etc. implies arbitrary 
tie-breaking (i.e., our decision on which one of the tied margins is considered smallest, and which 


one second smallest is of no consequence). 





Theorem 7 Assume that the data is separable and that the margin-maximizing separating hyper- 


: ; ; : (p) ; 
plane, as defined in (4) is not unique. Then any convergence point of pete) will correspond to a 


margin-maximizing separating hyper-plane which also maximizes the second smallest margin. 


Proof The proof is essentially the same as that of Theorem 3. We outline it below. 


From Theorem 3 we know that we only need to consider margin-maximizing models as limit 
points. Thus let B1, B2 be two margin maximizing models with /, norm 1, but let B; have a bigger 
second smallest margin. Assume that B; attains its smallest margin on observation i; and B> attains 
the same smallest margin on observation i2. Now define 


mı = miny;h(x;)'B, > miny;h(x;) B2 = m. 
ifii ifin 


Then we have that Lemma 4 of Theorem 3 holds for B; and B» (the proof is exactly the same, except 
that we ignore the smallest margin observation for each model, since these always contribute the 
same amount to the combined loss). 


Let B* be a convergence point. We know B* maximizes the margin from Theorem 3. Now 
assume B also maximizes the margin but has bigger second-smallest margin than B*. Then we can 
proceed exactly as the proof of Theorem 3, considering only n — 1 observations for each model and 
using our modified Lemma 4, to conclude that B* cannot be a convergence point (again note that the 
smallest margin observation always contributes the same to the loss of both models). a 


In the case that the two smallest margins still do not define a unique solution, we can continue 
up the list of margins, applying this result recursively. The conclusion is that the limit of the normal- 
ized, l, -regularized models “maximizes the margins”, and not just the minimal margin. The only 
case when this convergence point is not unique is, therefore, the case that the whole order statistic of 
the optimal separator is not unique. It is an interesting research question to investigate under which 
conditions this scenario is possible. 
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